De novo annotation of lncRNA HOTAIR transcripts by long-read RNA capture-seq reveals a differentiation-driven isoform switch

Background LncRNAs are tissue-specific and emerge as important regulators of various biological processes and as disease biomarkers. HOTAIR is a well-established pro-oncogenic lncRNA which has been attributed a variety of functions in cancer and native contexts. However, a lack of an exhaustive, cell type-specific annotation questions whether HOTAIR functions are supported by the expression of multiple isoforms. Results Using a capture long-read sequencing approach, we characterize HOTAIR isoforms expressed in human primary adipose stem cells. We find HOTAIR isoforms population displays varied splicing patterns, frequently leading to the exclusion or truncation of canonical LSD1 and PRC2 binding domains. We identify a highly cell type-specific HOTAIR isoform pool regulated by distinct promoter usage, and uncover a shift in the HOTAIR TSS usage that modulates the balance of HOTAIR isoforms at differentiation onset. Conclusion Our results highlight the complexity and cell type-specificity of HOTAIR isoforms and open perspectives on functional implications of these variants and their balance to key cellular processes. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08887-w.

LncRNA folding into secondary and tertiary structures dictates their interactome, making their function dependent on structural conservation [20]. While the full-length HOTAIR sequence is poorly evolutionarily conserved, folding prediction has identified two well-conserved structures in the 5' and 3' ends of HOTAIR [21]. The currently reported primary structure of the HOTAIR gene is complex, with 2 predicted promoters, multiple predicted transcription start sites (TSSs) and potential splice sites leading to several isoforms [22][23][24][25]. HOTAIR transcripts can harbor small exon length variations [26], or alternative splice site usage that eliminates the PRC2 [18] or RBP [19] binding domains. Therefore, distinct HOTAIR splice variants likely have distinct functions, warranting an isoform-specific annotation in relevant tissues.
Current reference annotations for lncRNAs are incomplete due to their overall low expression level, weak evolutionary conservation, and high tissue specificity [27]. Identification of lncRNA isoforms using short-read RNA sequencing (RNA-seq) is challenging because almost every exon can be alternatively spliced [28], and short reads cannot resolve the connectivity between distant exons. Long-read sequencing technologies can address this challenge by covering the entire RNA sequence in a single read, enabling mapping of isoform changes that may impact lncRNA structure and function [29].
Here, we combine long-read Capture-seq and Illumina short-read RNA-seq to resolve changes in the composition of HOTAIR isoforms in a well-characterized adipogenic differentiation system [30]. We uncover a temporal shift in the composition of HOTAIR isoforms upon induction of differentiation, regulated by distinct promoter usage and hormonal and nutrient-sensing pathways.

HOTAIR is highly expressed in ASCs and regulated during adipogenesis
We first assessed HOTAIR expression level in ASCs versus cancer cell lines where HOTAIR function has been previously studied [28]. We find that HOTAIR expression level in ASCs from two unrelated donors is higher or comparable to that of cancer cell lines (Fig. 1a), confirming the relevance of primary ASCs as a model system to assess the relative abundance of HOTAIR isoforms.

Identification of main HOTAIR isoforms
To identify HOTAIR isoforms, we performed PacBio single-molecule, long-read isoform sequencing of captured polyadenylated HOTAIR transcripts (PacBio Captureseq) in proliferating ASCs and during adipose differentiation. Full-length reads were clustered into non-redundant transcripts and aligned to the hg38 genome assembly, providing excellent coverage over the HOTAIR locus with sharp exon boundaries (Fig. 2a,b). The Isoseq3 pipeline yielded ~ 6000 HOTAIR isoforms; these were further filtered and merged both across time points and based on internal junctions using Cupcake ToFU [31,32] or TAMA [33] (Fig. 2a), resulting in 34 isoforms (Fig. 2a,c; Additional file 2, Table S1, Additional file 3).
To characterize these HOTAIR isoforms, we first classified transcripts into four main categories using SQANTI [34]. Out of the 34 aforementioned isoforms, we find (i) 8 full splice matches (FSM) transcripts, (ii) 6 incomplete splice matches (ISM) of an annotated (known) transcript, (iii) 12 novel in catalog (NIC) transcripts containing new combinations of annotated splice sites, and (iv) 8 novel not in catalog (NNC) transcripts using at least one unannotated splice site (Fig. 2d, Additional file 1, Fig. S2). Second, we assessed isoform variations at the 5' and 3' ends. Transcripts start sites supported by Cap analysis of gene expression (CAGE) data are distributed evenly across SQANTI categories, with 15 transcripts starting within 15 bp of a CAGE peak summit (Fig. 2e). Transcripts with a TSS located more than 250 bp away from a CAGE peak likely represent lowly expressed or highly cell typespecific isoforms of HOTAIR, explaining the absence of a dedicated CAGE peak [35,36]. HOTAIR full length transcripts also show variable 3' ends (exon 7; E7) corresponding to the alternative usage of 4 different canonical polyA signals for transcription termination (Fig. 2f, Additional file 2, Table S2). Third, we find the highest number of reads for transcripts in FSM, ISM and NIC SQANTI categories (Fig. 2g), indicating that most HOTAIR transcripts identified here have a known exon and splice junction composition.
To identify the top isoforms expressed across differentiation, we further filtered candidates based on read counts (Fig. 2h). Only 23 transcripts accumulate more than 100 reads, and these are also detected in at least 2 samples. These 23 high-confidence isoforms arise from multiple TSS usage, alternative splicing and intron retention events, as well as polyA site usage (Fig. 2i). Altogether, our PacBio sequencing analysis identifies with high confidence known and novel uncharacterized HOTAIR transcript isoforms in our adipose differentiation system, with notable variation in their TSS and polyA site usage.
Inclusion of the LSD1 binding domain in HOTAIR isoforms depends on polyA site usage (see Figs. 3a and 2f,i); thus we examined potential changes in exon E7 length during adipogenesis. The proportion of PacBio reads for the long forms of E7, harboring the LSD1 binding domain, does not significantly vary during differentiation (Fig. 3b). In agreement, the expression profile of LSD1-containing HOTAIR isoforms (E7 long and medium) is similar to that of the total isoform pool (E7 short) (Fig. 3c, see Fig. 1bc), confirming that alternative polyadenylation pattern of HOTAIR is maintained during differentiation (Fig. 3c). Hence, variations in HOTAIR isoforms during adipogenesis do not in principle impact its LSD1 scaffolding function.
We next investigated alternative splicing events affecting the PRC2 binding domain. Analysis of exon E5 alternative splicing reveals two major alternative splice sites (site 1 and site 3) (Fig. 3d). Induction of differentiation promotes splicing at site 3, leading to a slight increase in proportion of the shorter E5 variant (Fig. 3e). This splicing event can be readily detected by semi-quantitative RT-PCR analysis of exon E3-E5 expression in ASCs from two independent individuals (Fig. 3f, Additional file 1, Fig. S3a,b,c), emphasizing the robustness of the PacBio approach in capturing relative variations in HOTAIR transcript levels during differentiation. Overall, PacBio Capture-seq analysis reveals that HOTAIR is expressed as a pool of isoforms with differential ability to bind LSD1 and PRC2. However, adipogenic induction does not significantly affect the proportion of isoforms with LSD1 or PRC2 binding capacity.

Adipogenic induction triggers a switch in HOTAIR isoform start sites
One noticeable feature of HOTAIR isoform pool is the presence of 9 distinct start sites (Fig. 4a, see Fig. 2i). To assess TSS usage, we first quantified the total number of reads for each exon start category in PacBio Capture-Seq dataset (Fig. 4b). Only transcripts starting from exons E2, E3, E3.1 and E5 cumulated more than 500 reads during differentiation. We next quantified the proportion of each of HOTAIR 23 high-confidence isoform at each time point (Fig. 4c). Strikingly, while E3.1-starting isoforms contribute to the majority of reads prior to differentiation onset (Pro, D0), adipogenic induction triggers a decrease in the E3.1-starting isoform pool which is compensated by an increase in E3-starting isoform expression ( Fig. 4c,d,e). Short-read RNA-seq analysis confirms this switch in TSS usage, with exon E3 becoming significantly more expressed than E3.1 upon differentiation onset (D1; Fig. 4f ). The sharp drop in E3.1 starting isoforms is readily observed by semi-quantitative RT-PCR using primers spanning exons E3.1-E4, while expression of E3-E5 containing isoforms is maintained during early adipogenesis (Fig. 4g, left; Additional file 1, Fig. S3b).   Fig. S3a), indicating an adipose-specific regulation of HOTAIR isoforms. We conclude that adipogenic commitment triggers a switch in HOTAIR TSS usage, potentially impacting functional binding domains located in 5' exons [19].

ASCs express a cell type-specific HOTAIR isoform
To assess cell type-and tissue-specificity of E3.1 starting HOTAIR isoforms, we used Snaptron [41], a search engine for querying splicing patterns in publicly available RNA-seq datasets (Additional file 2, Table S3). We found only 24 cell samples with more than 10 reads containing the E3.1-E4 junction, while the E3-E4 junction was detected in 672 samples ( Fig. 5a; Additional file 2, Table S4). We confirmed by semi-quantitative RT-PCR the expression of exon E3.1 in cultured primary myoblasts, BJ fibroblasts and HEK 293 T cells -albeit to lower levels than in ASCs -and its absence in HeLa cells (Fig. 5b).
To gain insight into the regulation of HOTAIR isoform expression, we examined the chromatin accessibility landscape of the HOTAIR locus in published Assay for Transposable Accessible Chromatin (ATAC)-seq data [44][45][46][47] (Fig. 5c). We find two regions of accessible ('open') chromatin in ASCs (R3, R4), which our chromatin immunoprecipitation (ChIP)-seq data show are also enriched in H3K4me3 and H3K27ac, histone modifications characterizing active regulatory sites (Fig. 5c, d). Region R3 coincides with both activating and repressing regulatory element (REM) annotations by EpiRegio [42], suggesting it constitutes a bivalent regulator for HOTAIR or nearby genes expression. Region R4, located immediately upstream of exon E3.1 is also in an 'open' chromatin in myoblasts but not in HeLa cells, which respectively do and do not express E3.1-starting transcripts (Fig. 5b,c), and likely represents the active promoter for E3.1-starting isoforms. In contrast, these regions are in a 'closed' configuration in mature adipocytes, consistent with HOTAIR downregulation during adipogenesis (see Fig. 1b). Thus, HOTAIR displays active regulatory sites in ASCs, in line with the expression of specific isoforms. However, HOTAIR upregulation upon cell cycle arrest is not accompanied by changes of H3K4me3 and H3K27ac levels, and loss of H3K4me3 at region 4 occurs only at later differentiation time points (D3 and D9). Thus, modulation of HOTAIR levels during adipogenesis is not mediated by epigenetic regulation (Fig. 5d).

HOTAIR stability increases upon cell cycle arrest
HOTAIR half-life varies according to cell type, from 4 h in HeLa cells [13] to more than 7 h in primary trophoblast cells [48]. We therefore asked whether increased HOTAIR levels upon cell cycle arrest (D0; see Fig. 1b) could relate to a change in lncRNA stability. Treatment with Flavopiridol to inhibit RNA polymerase II (Pol II) transcription reveals an increase in HOTAIR half-life from 3 h in proliferating ASCs to > 4 h in D0 cells (Fig. 6a), while cell cycle arrest does not impact the stability of control short-lived CEBPD or long-lived GAPDH mRNA (Fig. 6b,c). Thus, while ATAC-seq data indicate that the HOTAIR isoform balance is likely regulated by cell type-specific transcription factors, the global increase in HOTAIR levels observed at D0 results at least in part from an increase in its stability.

Discussion
LncRNAs can modulate important biological and pathophysiological processes through their interaction with multiple partners. Our long-read Capture-seq of HOTAIR unveils a complex and dynamic pool of isoforms in differentiating ASCs. Adipogenic induction triggers a switch in the expression of main HOTAIR isoforms which likely impacts on its structure and interactome. Our results emphasize the importance of robust lncRNA annotation in the tissue of interest prior to functional characterization.
A single lncRNA locus can generate many transcript variants through alternative TSS usage, polyadenylation sites and splicing events [49]. We find that previously undescribed HOTAIR variants constitutes the major isoforms in human adipose stem cells, and show that not only lncRNA expression level, but also the isoform pool expressed can be highly cell type-specific, adding another layer of complexity to the regulation of biological processes by lncRNAs. The chromatin landscape of the HOTAIR locus in ASCs is consistent with a promoter upstream of HOTAIR exon E3.1, which is in an 'open' and active epigenetic configuration in ASCs and myoblasts, where HOTAIR E3.1 starting isoforms are expressed, but not in HeLa cells. However, adipogenic induction results in only mild and delayed changes in active histone modification at this site, suggesting that expression of various HOTAIR isoforms is rather regulated by differentiation-stage specific transcription factors. In line, alternative promoter usage is elicited downstream of adipogenic, but not osteogenic, signaling pathways. Additionally, increased HOTAIR levels upon cell cycle arrest (D0 cells) also correlates with an increase in its stability. Interestingly, interaction with the RNA-binding protein HuR, a negative regulator of adipogenesis [50], reduces HOTAIR stability [13]. Alternatively, increased HOTAIR stability could result from its binding to a cell stage-specific protein partner. Collectively, our results indicate a tight, multifactorial regulation of isoform pool during adipogenesis, arguing for a functional importance of the isoform switch.
In cancer cell lines, HOTAIR has been reported to scaffold for chromatin regulators with broad effects on gene expression [51][52][53]. Our long-read sequencing data reveal multiple polyA site usage leading to the exclusion of the LSD1 binding domain at the 3' end, or alternative splicing events affecting the PRC2 minimal binding domain. However, the proportion of HOTAIR isoforms  Fig.S7. c IGV tracks over HOTAIR locus showing predicted HOTAIR regulatory elements (REMs) from the EpiRegio database [42,43] (red: activating, blue: repressing), average ATAC-Seq tracks from ASCs [44,45], Myoblasts [46] and HeLa cells [47], and ChIP primers location. d ChIP-qPCR analysis of histone modifications during adipogenesis of ASCs (mean ± SEM of n ≥ 3 independent differentiation experiments; **p < 0.01 Mixed effect analysis) containing LSD1 or PRC2 domains does not vary during adipose differentiation, suggesting that HOTAIR's role in adipogenesis is independent from its epigenetic scaffold function. Supporting this idea, HOTAIR depletion does not significantly affect PRC2-mediated gene regulation in adipose progenitors [12]. Recent studies have confirmed that a non-protein-coding locus can give rise to functionally distinct transcript isoforms [49,[54][55][56]. Of note, the switch in HOTAIR start site upon differentiation induction leads to the inclusion of HOTAIR exon 3 containing a protein binding domain [19], which likely alters HOTAIR function. Another intriguing possibility is that short sequence variations at the 5' end impact HOTAIR's secondary structure and thus the folding of functional domains. Observation of HOTAIR structure using atomic force microscopy reveals multiple dynamic conformations [57]. It is therefore conceivable that HOTAIR functions via conformational changes, induced by or resulting from interactions with protein partners. Hence, variations in isoform composition likely results in cell type specific structures and interactomes, which could account for the divergent roles of HOTAIR in primary cells and cancer cell lines.

Conclusions
We generate the first cell type-specific, comprehensive catalog of HOTAIR isoforms in a physiological context and describe novel HOTAIR isoforms, alternative splicing events, and multiple start site usage. We find that HOTAIR splicing in ASCs often leads to the exclusion or truncation of canonical LSD1 and PRC2 binding domains. We uncover a shift in HOTAIR TSS usage that controls the balance of HOTAIR isoforms during early adipogenesis The variability of HOTAIR isoforms opens new perspectives for studies in (patho)physiological contexts.

Methods
All methods were performed in accordance with the guidelines and regulations of the University of Oslo.

RT-qPCR and semi-quantitative PCR
Total RNA was isolated using the RNeasy kit (QIA-GEN) and 1 µg was used for cDNA synthesis using the High-Capacity cDNA Reverse Transcription Kit (Ther-moFisher). RT-PCR was done using IQ SYBR green (Biorad) with SF3A1 as a reference gene. PCR conditions were 95 °C for 3 min and 40 cycles of 95 °C for 30 s, 60 °C for 30 s, and 72 °C for 20 s. Semi-quantitative PCR was done using a PCR Master Mix (Ther-moFisher) with the following conditions: 95 °C for 3 min and 30 cycles of 95 °C for 30 s, 60 °C for 30 s, and 72 °C for 30 s. Products were separated in a 2.5% agarose gel with Tris-Borate-EDTA buffer. PCR primers are listed in Additional file 2 Table S5. Uncropped gels are presented in Additional file 4, Fig. S4 to S7.
Short-read Illumina RNA-sequencing and data analysis.
Differentiation time courses from ASC-1 with 5 time points were sequenced in biological triplicates with short (40 bp), paired end reads on Illumina NextSeq. Reads were aligned to the hg38 genome (ensembl v95 annotation) using hisat2, and counted with feature-Counts (-fraction -M). To further analyze differentially expressed genes, edgeR and limma packages were used. Low abundance genes were filtered using edgeR's function "filterByExpr". Genes with an adjusted p-value < 0.01 (eBayes method, limma package) between two or more consecutive time points were clustered with DPGP software. FDR adjusted p-value for Adipogenesis cluster was generated by overrepresentation analysis using Hallmark gene sets from MSigDb (v7.4) with all human genes as background. FeatureCounts with strict parameters (-f -g exon_id -p -s 2 -O -fraction -B -C) was used to quantify exon coverage as normalized counts per kilobase of exon sequence.

PacBio capture-seq
PacBio Capture-seq was performed on duplicate differentiation time courses from ASC-1. For each differentiation, RNA samples from 5 differentiation time points were used to synthesize full-length barcoded cDNA libraries using the Template Switching RT Enzyme Mix (NEB). Libraries were prepared using Pacific Biosciences protocol for cDNA Sequence Capture Using IDT xGen ® Lockdown ® probes (https:// eu. idtdna. com/ site/ order/ ngs). A pool of 100 probes against all known HOTAIR isoform sequences was designed using the IDT web tool. Full-length cDNA was cleaned up using Pronex beads, and 1200 ng was used for each hybridization reactions. Library was sequenced in one 8 M SMRT cell on a Sequel II instrument using Sequel II Binding kit 2.1 and Sequencing chemistry v2.0.

Transcript identification from targeted long-read sequencing
CCS sequences were generated for the entire dataset using the Circular Consensus Sequence pipeline (SMRT Tools v 8.0.0.80502) with minimum number of passes 3 and minimum accuracy 0.99. CCS reads were demultiplexed using the Barcoding pipeline (SMRT Tools v7.0.0.63823). Iso-Seq analysis was performed using the Iso-Seq pipeline (SMRT Link v7.0.0.63985) with default settings. Only clustered isoforms with at least 2 subreads, 0.99 quality score and containing a polyA tail of at least 20 base pairs (bp) were used. Isoforms were aligned to the hg38 genome with minimap2 v.2.17 [59]. Primary alignments to the HOTAIR locus (chr12: 53,962,308-53,974,956) were selected as target HOTAIR transcripts if they had a mapping quality above 20 and less than 50 clipped nucleotides (samclip; https:// github. com/ tseem ann/ samcl ip). Single exon and sense transcripts were also filtered.
HOTAIR transcripts were collapsed with both Cupcake Tofu [31] (https:// github. com/ Magdo ll/ cDNA_ Cupca ke) and TAMA [33] (https:// github. com/ Genom eRIK/ tama). Parameters -dun-merge-5-shorter and -x capped were used for cupcake and TAMA respectively, to prevent shorter transcript models from being merged into longer ones. For TAMA, -z 100 was also set to increase the allowed 3' variability. To combine transcript lists between timepoints, collapsed transcripts with at least 50 full length reads within one sample were merged using cupcake chain_samples.py or tama_merge.py. Initially this produced ~ 80 HOTAIR transcripts, with much of the variation in the ends of 3' and 5' exons. To achieve a final transcript list, collapsed transcripts were merged on internal junctions only by increasing the allowed 5' and 3' variability (with TAMA options -a 300 -z 2000) and ranking libraries by the number of polyadenylated HOTAIR reads. This resulted in 34 TAMA isoforms.
Final cross-validation was conducted by running SQANTI with unclustered ccs reads as the "novel long read-defined transcriptome" and the top 34 TAMA isoforms as the "reference annotation". This assigned full length reads to their corresponding TAMA isoform and reads annotated as full-splice match were counted for each isoform. SQANTI was used again to classify transcripts against existing reference annotations and to search for polyA motifs near transcript ends from a list of potential human motifs (Additional file 2, Tables S1, S2) [34] (https:// github. com/ Cones aLab/ SQANTI). Reference isoforms were collected from four sources: Ensembl (v95), RefSeq, UCSC browser's "lincRNA and TUCP transcripts" and Fantom CAT (Hon 2015), the last of which included isoforms from encode, hubmap and miTranscriptome (Additional File 1 Fig. S2) SQANTI categories are defined relative to the reference isoforms as follows (i) FSM: the number of exons and all internal junctions are concordant with the reference isoform. (ii) ISM: the isoform has less 5' exons than the reference, and yet the internal junctions are perfectly consistent. (iii) NIC: all donor and acceptor sites exist in the reference isoform list but their combination in a single isoform is novel. (iv) NNC: at least one of the donor or acceptor sites in the isoform is not found in the reference list. For all categories, the exact length of the 5' and 3' ends of first and last exons, respectively, can differ by any amount.

Snaptron
We searched Snaptron's SRA V2 database [41], which contains ~ 49,000 public samples, for experiments with reads overlapping HOTAIR exon-exon junctions with the web query http:// snapt ron. cs. jhu. edu/ srav2/ snapt ron? regio ns= HOTAI R& rfilt er= annot ated:1 (Additional File 2, Table S3). Sample IDs for experiments with at least 10 exon spanning reads were extracted and cell type information accessed via a matching metadata query. A simplified version of the exon E3.1 metadata table is presented in Additional file 2 Table S4.